Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing ak._v2.to_parquet. #1440

Merged
merged 16 commits into from
Apr 26, 2022
Merged

Implementing ak._v2.to_parquet. #1440

merged 16 commits into from
Apr 26, 2022

Conversation

jpivarski
Copy link
Member

No description provided.

@codecov
Copy link

codecov bot commented Apr 22, 2022

Codecov Report

Merging #1440 (2ae5205) into main (edfce38) will increase coverage by 0.08%.
The diff coverage is 63.40%.

Impacted Files Coverage Δ
src/awkward/_v2/_connect/cuda/__init__.py 0.00% <0.00%> (ø)
src/awkward/_v2/_connect/jax/__init__.py 0.00% <0.00%> (ø)
src/awkward/_v2/contents/recordarray.py 82.26% <ø> (+0.76%) ⬆️
src/awkward/_v2/contents/unionarray.py 86.50% <ø> (ø)
src/awkward/_v2/operations/convert/ak_from_cupy.py 50.00% <0.00%> (+23.33%) ⬆️
src/awkward/_v2/operations/convert/ak_from_iter.py 93.75% <ø> (ø)
src/awkward/_v2/operations/convert/ak_from_jax.py 50.00% <0.00%> (-25.00%) ⬇️
src/awkward/_v2/operations/convert/ak_to_cupy.py 33.33% <0.00%> (+23.95%) ⬆️
src/awkward/_v2/operations/convert/ak_to_jax.py 33.33% <0.00%> (-41.67%) ⬇️
src/awkward/_v2/operations/describe/ak_backend.py 10.00% <0.00%> (-2.50%) ⬇️
... and 92 more

@martindurant
Copy link
Contributor

Please see here for the list of options we thought would be important when we talked this through.

@jpivarski
Copy link
Member Author

compression per data type or per leaf column ("path.to.leaf": "zstd" format)

Per leaf column is implemented: all ParquetWriter arguments that accept a dict or list can be selected via a dict of Awkward column selectors. Selectors are strings or iterables of strings that get passed to Form.select_columns (which slice through Forms to make new Forms):

>>> array = ak._v2.Array([[{"x": 1.1, "y": [1], "z": "one"}, {"x": 2.2, "y": [1, 2], "z": "two"}], [], [{"x": 3.3, "y": [1, 2, 3], "z": "three"}]])
>>> print(array.layout.form.select_columns("x"))
{
    "class": "ListOffsetArray",
    "offsets": "i64",
    "content": {
        "class": "RecordArray",
        "contents": {
            "x": "float64"
        }
    }
}
>>> print(array.layout.form.select_columns("y"))
{
    "class": "ListOffsetArray",
    "offsets": "i64",
    "content": {
        "class": "RecordArray",
        "contents": {
            "y": {
                "class": "ListOffsetArray",
                "offsets": "i64",
                "content": "int64"
            }
        }
    }
}
>>> print(array.layout.form.select_columns("z"))
{
    "class": "ListOffsetArray",
    "offsets": "i64",
    "content": {
        "class": "RecordArray",
        "contents": {
            "z": {
                "class": "ListOffsetArray",
                "offsets": "i64",
                "content": {
                    "class": "NumpyArray",
                    "primitive": "uint8",
                    "parameters": {
                        "__array__": "char"
                    }
                },
                "parameters": {
                    "__array__": "string"
                }
            }
        }
    }
}
>>> print(array.layout.form.select_columns(["x", "y"]))
{
    "class": "ListOffsetArray",
    "offsets": "i64",
    "content": {
        "class": "RecordArray",
        "contents": {
            "x": "float64",
            "y": {
                "class": "ListOffsetArray",
                "offsets": "i64",
                "content": "int64"
            }
        }
    }
}

They're wildcard-friendly and don't have the "list.item/list.element" baggage.

Once sliced, column_types gives you Parquet-relevant types for the columns:

>>> array.layout.form.column_types()
(dtype('float64'), dtype('int64'), 'string')

Again, it ignores whatever lists or option-types it encountered on the way down to the leaves, and this "string" means string or bytestring, distinct from any dtypes.

They were to be used here:

https://github.com/scikit-hep/awkward-1.0/blob/6286e29ec769e95ef882f7f4a79eb43212bf0b53/src/awkward/_v2/operations/convert/ak_to_parquet.py#L123-L145

But Arrow is saying that nested column buffers made this way are too short, that the file is possibly corrupted. If I'm not misunderstanding something, this looks like a bug in pyarrow.

byte stream split for floats if compression is not None or lzma

See above.

partitioning

If the data is an iterator without len, then it will iterate over everything in it and make one row-group per item.

https://github.com/scikit-hep/awkward-1.0/blob/6286e29ec769e95ef882f7f4a79eb43212bf0b53/src/awkward/_v2/operations/convert/ak_to_parquet.py#L41-L50

parquet 2 for full set of time and int types
v2 data page (for possible later fastparquet implementation)

https://github.com/scikit-hep/awkward-1.0/blob/6286e29ec769e95ef882f7f4a79eb43212bf0b53/src/awkward/_v2/operations/convert/ak_to_parquet.py#L25-L26

I haven't tried it out yet.

dict encoding always off

That would be use_dictionary=False (though I think you'd want it on some string-valued fields).

https://github.com/scikit-hep/awkward-1.0/blob/6286e29ec769e95ef882f7f4a79eb43212bf0b53/src/awkward/_v2/operations/convert/ak_to_parquet.py#L172

But the apparent bug I mentioned above is making me hesitate about that.

@jpivarski
Copy link
Member Author

This is pretty much done. I think the options might change; in particular, I think it would be much better to be writing compliant list types, but there's an issue with that.

@jpivarski jpivarski marked this pull request as ready for review April 26, 2022 19:57
@jpivarski jpivarski enabled auto-merge (squash) April 26, 2022 19:57
@martindurant
Copy link
Contributor

If the data is an iterator without len, then it will iterate over everything in it and make one row-group per item.

I meant directory name partitioning, ie., following group-by.

Per leaf column compression is implemented

(I forgot to mention that we discussed what decent defaults might be, per data type)

I know nothing about the possible pyarrow bug.

@jpivarski jpivarski merged commit c1ffcce into main Apr 26, 2022
@jpivarski jpivarski deleted the jpivarski/start-v2-to_parquet branch April 26, 2022 20:32
@jpivarski
Copy link
Member Author

The variable named row_group_number is not strictly right because the writer might write more than one row group in a call. It should be named iteration_number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants